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ABSTRACT Thanks to high-throughput sequencing technologies, genome sequencing has become a common component in 
nearly all aspects of viral research; thus, we are experiencing an explosion in both the number of available genome sequences and 
the number of institutions producing such data. However, there are currently no common standards used to convey the quality, 
and therefore utility, of these various genome sequences. Here, we propose five "standard" categories that encompass all stages 
of viral genome finishing, and we define them using simple criteria that are agnostic to the technology used for sequencing. We 
also provide genome finishing recommendations for various downstream applications, keeping in mind the cost-benefit trade- 
offs associated with different levels of finishing. Our goal is to define a common vocabulary that will allow comparison of ge- 
nome quality across different research groups, sequencing platforms, and assembly techniques. 



Viruses represent the greatest source of biological diversity on 
Earth, and with the help of high-throughput (HT) sequencing 
technologies, great strides are being made toward the genomic 
characterization of this diversity (1-3). Genome sequences play a 
critical role in our understanding of viral evolution, disease epi- 
demiology, surveillance, diagnosis, and countermeasure develop- 
ment and thus represent valuable resources which must be prop- 
erly documented and curated to ensure future utility. Here, we 
outline a set of viral genome quality standards, similar in concept 
to those proposed for large DNA genomes (4) but focused on the 
particular challenges of and needs for research on small RNA/ 
DNA viruses, including characterization of the genomic diversity 
inherent in all viral samples/populations. Our goal is to define a 
common vocabulary that will allow comparison of genome qual- 
ity across different research groups, sequencing platforms, and 
assembly techniques. 

Despite the small sizes of viral genomes, complications related 
to limited RNA quantities, host "contamination," and secondary 
structure mean that it is often not time- or cost-effective to finish 
every genome, and given the intended use, finishing may be un- 
necessary (5). Therefore, we have used technology-agnostic crite- 
ria to define five standard categories designed to encompass the 
levels of completeness most often encountered in viral sequencing 
projects. Each viral family/species comes with its own challenges 
(e.g., secondary structure and GC content); therefore, we provide 
only loose guidance on the depth of sequence coverage likely re- 
quired to obtain different levels of finishing. In reality, a similar 
amount of data will generate genomes with different levels of fin- 
ishing for different viruses. 

To alleviate any reliance on particular aspects of the different se- 
quencing technologies, we have made two assumptions that should 
be valid in most viral sequencing projects. The first assumption is a 



basic understanding of the genomic structure of the virus being 
sequenced, including the expected size of the genome, the number 
of segments, and the number and distribution of maj or open read- 
ing frames (ORFs). Fortunately, genome structure is highly con- 
served within viral groups (6), and although new viruses are con- 
stantly being uncovered, the discovery of a novel family or even 
genus remains relatively uncommon (7). In the absence of such 
information, the defined standards can still be applied following 
further analysis to determine genome structure. The second as- 
sumption is that the genetic material of the virus being described 
can be accurately separated from the genomes of the host and/or 
other microbes, either physically or bioinformatically. Depending 
on the technology used, it is critical that the potential for cross- 
contamination of samples during the sample indexing/bar coding 
process and sequencing procedure be addressed with appropriate 
internal controls and procedural methods (8). 

PROPOSED CATEGORIES FOR WHOLE-GENOME 
SEQUENCING OF VIRUSES 

For a summary of the proposed categories for whole-genome se- 
quencing of viruses, see Fig. 1 and Table 1. 
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FIG 1 Graphical representation of viral genome standards. Bullets on the left represent primary distinctions between categories. Bullets on the right indicate 
potential downstream applications of genomes in each category. 



Standard draft (SD). The "standard draft" category is for 
whole shotgun genome assemblies with coverage that is low 
and/or uneven enough to prevent the assembly of a single contig 
for > 1 genome segments. Genomes in this category are likely to 
result from samples with low viral titers, such as clinical and en- 
vironmental samples, or to be those containing regions that are 
difficult to sequence across (e.g., intergenic hairpin regions) (9). 
To distinguish standard drafts from targeted amplification of par- 
tial viral sequences, standard drafts should contain at least 1 contig 
for each genomic segment and should be prepared in a manner 
that allows the possibility of sequencing the vast majority of a 
virus's genome. To avoid the inclusion of small pieces of genomes 
as "drafts," there needs to be some type of minimum cutoff for 
breadth of coverage. Therefore, we suggest that at least a majority 
(>50%) of the genome be present for a set of sequences to be 
considered a draft genome. 



High quality (HQ). Genomes should be considered high qual- 
ity if no gaps remain (i.e., a single contig per genome/segment), 
even if one or more ORFs remain incomplete due to missing se- 
quence at the ends of segments. An HQ genome can often be 
achieved with modest levels of HT sequencing coverage (-15 to 
30 X ) or through Sanger-mediated gap resolution of an SD. 

Coding complete (CC). The "coding complete" category indi- 
cates that in addition to the lack of gaps, all ORFs are complete. 
This level of completion is typically possible with high levels of HT 
sequencing coverage (>100X) or may require the use of con- 
served PCR primers targeting the ends of the segments. 

Complete. A genome is complete when the genome sequence 
has been fully resolved, including all non-protein-coding se- 
quences at the ends of the segment(s). This is typically achieved 
through rapid amplification of cDNA ends (RACE) or similar 
procedures. 



TABLE 1 Overview of viral genome standards 
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" It is suggested that all bases included in any incomplete genome meet a minimum quality standard, with ^5 reads supporting the consensus base call with individual base qualities 
of ^20 on the Phred scale. 

b Percentages of genome covered are not meant to serve as criteria for categorizing a genome; they are simply estimates of expected levels of coverage. 
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Finished. This final category represents a special instance in 
which, in addition to having a completed consensus genome se- 
quence, there has been a population-level characterization of 
genomic diversity. Typically this requires -400 to 1,000X cover- 
age (see below). This provides the most complete picture of a viral 
population; however, this designation will apply only for a single 
stock. Additional characterizations will be necessary for future 
passages. 

ADDITIONAL HIGH-THROUGHPUT SEQUENCE-BASED 
GENOME CHARACTERIZATIONS 

Population-level characterization. HT sequencing technologies 
provide powerful platforms for investigating the genetic diversity 
within viral populations, which is integral to our understanding of 
viral evolution and pathogenesis (10, 11). Population-level char- 
acterization requires very high levels of HT sequencing coverage 
(12, 13); however, the exact level will depend on the background 
error profiles of the sequencing technology and the desired level of 
sensitivity. As an example, Wang et al. (12) determined that for 
pyrosequencing data, -400 X coverage is necessary to identify mi- 
nor variants present at 1% frequency with 99.999% confidence, 
and - 1,000 X coverage is needed for variants with a frequency of 
0.5%. Targeted amplification of the viral genome is often neces- 
sary to achieve these coverage requirements. Due to the modest 
sequence lengths of most HT technologies, the state of the art for 
population-level analysis has been the characterization of un- 
phased polymorphisms. However, single-molecule technologies, 
with maximum read lengths of >20 kb, are opening the door for 
complete genome haplotype phasing (14). 

Identification of contaminants or adventitious agents. After 
isolation, viruses are often maintained as stocks, which are prop- 
agated within host cells in tissue culture and thus amplified and 
preserved for future use. Despite careful laboratory practices, it is 
possible for these stocks to become contaminated with additional 
microbes. Contaminating microbes are often detrimental to sub- 
sequent applications such as vaccine development or the testing of 
therapeutics, making it imperative to monitor the purity of viral 
stocks. HT sequencing provides a powerful method for not only 
detecting the presence of contaminants within a sample but also 
for identification and characterization of any contaminants. The 
level of sequencing required for contamination analysis is depen- 
dent on the desired sensitivity, with more sequencing required to 
ensure detection of contaminants present at very low levels. For 
most approaches, HQ-level sequencing should be sufficient. De- 
pending on the intended applications, analysis may need to be 
repeated after further passaging to ensure that no additional con- 
taminants have been introduced. 

RECOMMENDED STANDARDS FOR DOWNSTREAM 
APPLICATIONS 

Description of novel viruses. Despite the rapidly growing collec- 
tion of viral sequences, the description of novel viruses is likely to 
remain an important aspect of viral genome sequencing (7, 15, 
16). This is true in part because viruses evolve rapidly and are 
capable of recombining to form novel genotypes ( 17, 18). It is also 
true that most of the viruses that are currently circulating remain 
uncharacterized (15). Particularly lacking are representatives 
from groups that are not currently known to infect humans or 
organisms of economic importance. It would be imprudent, how- 
ever, to continue to ignore these uncharacterized reservoirs of 



diversity, because it is difficult to predict the source of future 
emerging diseases (19-21). Additionally, with the current suite of 
primarily sequence similarity-based pathogen identification tools, 
the ability to detect novel pathogens is wholly dependent on high- 
quality reference databases (22). There is a trend toward requiring 
a complete genome sequence when a description of a novel virus is 
being published, and we agree that this is a good goal; however, the 
amount of time and resources required to complete the last 1 to 
2% of a viral genome is often cost and time prohibitive for projects 
sequencing a large number of samples, and in most cases the very 
ends of the segments are not essential for proper identification and 
characterization. Therefore, for the majority of viral characteriza- 
tion projects, we recommend, at a minimum, a CC genome. This 
will ensure a complete description of the viral proteome and will 
allow accurate phylogenetic placement. 

Molecular epidemiology. One of the most common and im- 
portant applications for viral genomes is in the study of viral epi- 
demiology, which encompasses our understanding of the pat- 
terns, causes, and effects of disease. Early studies of molecular 
epidemiology targeted small pieces of viral genomes; however, this 
type of analysis is likely to miss important changes elsewhere in the 
genome. Therefore, there has been a strong focus in recent years 
toward the sequencing of "full" viral genomes. Institutes such as 
the Broad Institute and the J. Craig Venter Institute (JCVI) have 
been instrumental in breaking ground in the collection of large 
numbers of good-quality viral sequences. Their newly identified 
genomes typically fall within our CC category. This is likely to 
remain the gold standard for studies involving a large number of 
genome sequences, especially when some samples come from low- 
titer clinical samples, often necessitating amplicon-based se- 
quencing methods. CC genomes allow for interrogation of 
changes throughout the coding portion of the viral genome and 
often include partial noncoding regions. In the absence of high- 
throughput RACE alternatives, the time and resources required to 
complete hundreds or thousands of genomes are likely to con- 
tinue to outweigh the potential information gained from complet- 
ing the terminal sequences. 

Countermeasure development. Advancements in our capa- 
bilities to sequence viral genomes are changing the way we coun- 
teract global pandemics and acts of bioterrorism. There are two 
important aspects of countermeasure development that can ben- 
efit strongly from the availability of genome sequences and HT 
sequencing data: the detection of the infectious agent and the 
treatment of the disease caused by the agent. Taxonomic classifi- 
cation and detection through DNA/RNA-based inclusivity assays 
(i.e., using techniques such as PCR to detect the presence of a 
pathogen) can be designed using fragmented and incomplete ge- 
nomes (e.g., SD and HQ sequences). Fully resolved ORFs (CC) 
further enable the development of immunological assays, such as 
enzyme-linked immunosorbent assays (ELISA) and immunoflu- 
orescence assays (IFA), for protein-based detection, and obtaining 
a complete genome opens the door to a plethora of additional 
downstream applications, including the design of exclusivity tests, 
the establishment of reverse genetics systems, and the design of 
robust forensics protocols. However, for effective development 
and testing of animal models, therapeutics, vaccines, and prophy- 
lactics, it is necessary to obtain a complete picture of the variability 
present within both the challenge stock and postinfection popu- 
lations, thereby necessitating finished genomes. In these medical 
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applications, it is also important to demonstrate the absence of 
adventitious agents. 

REPOSITORIES OF GENOMIC INFORMATION AND DATA 
CURATION 

In addition to standardizing the vocabulary of viral genome as- 
semblies, it is also critical for researchers to routinely provide raw 
sequencing reads. Without these, it is impossible for others to 
independently verify the quality of an assembly. Data repositories 
such as GenBank already provide a platform for depositing HT 
sequencing reads, but this is not a requirement for the submission 
of a genome, nor is this option typically utilized. Wider analysis of 
data will ultimately result in higher-quality assemblies. It is worth 
considering broader implementation of a wild-like, crowd- 
sourcing strategy to genome assembly, similar to the annotation 
strategies that have been adopted for specific genomes of high 
interest (23, 24). This approach would allow multiple parties to 
work on genome assembly and annotation at the same time and 
would provide instant updates for the entire community to eval- 
uate and utilize in their own research. 

Our primary goal here is to initiate a conversation. The rate at 
which viral genomes are being sequenced is only going to increase 
in the coming years, and without some standardization, it will be 
impossible for these valuable resources to be utilized to their full 
potential. We present these categories as a starting point, with the 
goal of adjusting and refining them over time as our capabilities 
and needs continue to change. 
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